Exploring clade differentiation of the Faecalibacterium prausnitzii complex

Summary Faecalibacterium prausnitzii is one of the most prevalent and abundant polyphyletic health-promoting components of the human gut microbiome with a propensity for dysbiotic decreases. To better understand its biology in the human gut, we specifically explored the divergence pressures acting on F. prausnitzii clades on a global scale. Five F. prausnitzii clades were de novo identified from 55 publicly available genomes and 92 high-quality metagenome assembled genomes. Divergence rate indices were constructed and validated to compare the divergence rates among the different clades and between each of the diverging genes. For each clade we identified specific patterns of diverging functionalities, probably reflecting different ecological propensities, in term of inter-host dispersion capacity or exploitation of different substrates in the gut environment. Finally, we speculate that these differences may explain, at least in part, the observed differences in the overall divergence rates of F. prausnitzii clades in human populations.


INTRODUCTION
Faecalibacterium prausnitzii is one of the most wide-spread and abundant bacteria in the human gut microbiome (GM). It is probably an integral component of our evolutionary history which has populated our lineage for at least 1M years. 1 F. prausnitzii has been consistently reported as one of the main health-promoting components found in the intestine, 2 showing a crucial role in host nutrition and immunity, where it acts as an important anti-inflammatory commensal. 3 Indeed, recent studies 4-6 have shown that F. prausnitzii can attenuate the severity of inflammation through the release of a panel of anti-inflammatory metabolites, which enhance the intestinal barrier acting on tight junctions, as well as on peroxisome proliferator-activated receptor alpha (PPAR-a), PPAR-g and PPAR b/d genes. 7 Over the last few years an increasing number of studies have reported a depletion of F. prausnitzii in GMs associated with multiple diseases, enteric and non-enteric, [8][9][10][11][12] to the point that this bacterium has been proposed as a possible biomarker of dysbiotic shifts. This defines a complex scenario where, on the one hand, F. prausnitzii has a crucial role in maintaining gut homeostasis, but on the other hand it is extremely prone to dysbiotic reductions. However, at present, it still remains elusive which biotic and abiotic factors regulate its presence in the gut, the extent of their influence and the mechanisms involved in its retention.
First 16S rRNA gene-based phylogenetic analyses showed that at least two different F. prausnitzii phylogroups can be found in the human GM, whose distribution is different between healthy subjects and patients with gut disorders. 13,14 Most recently, the polytypic nature of F. prausnitzii has been confirmed, detecting up to 11 different clades, which show a different prevalence and/or abundance in the human GM depending on age, geographical origin and lifestyle. 15 These authors also confirmed the depletion of this species in inflammatory bowel disease and obesity. Although these findings certainly represent a milestone for a better understanding of F. prausnitzii biology in the human gut, there is still no evidence concerning possible selective pressures driving for the observed clades divergences, and it has not yet been investigated why such clades exhibited a markedly different distribution in the human population.
In an attempt to answer these questions, here we explored the dynamics involved in the divergence processes of the clade-specific marker genes in the F. prausnitzii complex, dissecting the peculiarities of each clade and providing some glimpses on the putative pressures selectively acting on each of them. Specifically, we reconstructed high-quality F. prausnitzii genomes from metagenomes (MAGs) starting from $750 healthy human gut metagenomes 16-22 and identified F. prausnitzii clades by implementing a We next assessed the distribution of the 5 clades in the human population (see STAR Methods). According to our data, the 5 clades we identified are distributed across the entire set of human populations considered, thus all the clades can be regarded as cosmopolitan ( Figure S4A). To investigate if these clades were mutually exclusive or able to co-inhabit the bowel, we evaluated the co-presence within the same metagenomic sample. This analysis clearly revealed that the degree of co-presence is variable in the human population, with some subjects harboring all the clades, whilst others not harboring F. prausnitzii at all. In particular, we observed that the presence/copresence of the F. prausnitzii clades was associated with age, geographical origin and subsistence strategy ( Figures S4B-S4D), confirming what previously highlighted in another study (De Filippis et al., 2020 15 ). Indeed, F. prausnitzii was almost always present in adults (96% contained at least 1 F prausnitzii clade, 18-69 years old), but the prevalence considerably decrease in infant (29%, <1 years old, Fisher's test p<0.01), and centenarians (40%, >99 years old, p<0.01). Lower prevalence was also detected in children (89%, 1-16 years old, p<0.01) and elderly people (89%, 70-97 years old, p<0.01). Finally, the intra-individual clades diversity was highly variable according to the geographical area and the related lifestyle, with higher levels in non-Western countries (e.g., Tanzania, Wilcoxon test p=0.0001), respect to Western countries, that showed a progressively lower prevalence for all clades from Europe to Japan through North America.

Construction and validation of divergence indices
To account for the rate of divergence between the F. prausnitzii clades, we developed two Divergence Rate Indices (DRIs), one at the clade level and the other at the gene level. The clade-level DRI (DRIc), was specifically conceived to account for the overall divergence rate of each clade and was computed as the natural logarithm (ln) of the ratio between the median number of single nucleotide polymorphisms (SNPs) of the whole set of clade-specific genes (M E ) -defined as genes present in at least the 95% of the genomes from a given clade and absent in the other clades -and the median SNPs for a basal set of housekeeping reference genes (M H ), i.e., genes showing a little divergence within a clade. The housekeeping references was constituted by a panel of 10 genes (recA, rplS, rplI, purN, mreB, maf, fmt, gyrB, rpoB, proC) (Table S3), comprising essential genes we found present in all the genomes and MAGs we analyzed. On the other hand, the gene-level DRI (DRIg) was created to account for the absolute divergence rate for a single clade-specific gene for a given clade and was Further, to fully assess divergence pressures, the aforementioned indices were implemented by considering the ratio of non-synonymous to synonymous substitutions (dN/dS). 27 Consistently, 2 non-synonymous divergence rate indices (NDRIs) were developed: (1) The clade-level NDRI (NDRIc), which considers the ln of the ratio between the mean dN/dS values for the whole set of clade-specific genes (m E ) and the mean dN/dS for the basal set of housekeeping reference genes (m H ), (2) the gene-level NDRI (NDRIg), as ln of the ratio between the dN/dS value of the selected clade-specific gene (m G ) and m H .
Generally, for a given clade, a positive value for the DRIc and NDRIc indices points out that the corresponding set of clade-specific genes are accumulating SNPs and non-synonymous SNPs faster than iScience Article housekeeping references; the higher the index values, the greater the divergence rate for the specific clade. Analogously, for a given clade-specific gene, a positive value for the DRIg and NDRIg indices indicates that the gene is accumulating SNPs and non-synonymous SNPs faster than housekeeping references; the higher the index values, the greater the divergence rate for the given clade-specific gene.
Because the DRIg and NDRIg indices were first necessarily computed at the level of the single metagenomes, to be then extrapolated at the population and metapopulation levels, and to verify any bias due to the sequencing yields, for each clade we performed Pearson's correlation tests between M H and m H values and metagenome lengths and the computed F. prausnitzii abundances. Correlations were also sought between gene prevalence and DRIg/NDRIg indices, to assess the presence of biases due to sequencing coverage on specific genes. According to our findings, no significant correlations were found (p>0.05).
Divergence dynamics: Each clade shows a distinctive profile This confirms that clade-specific marker genes are globally accountable for the divergence of the clades; hence, investigating their function may provide new glimpses over the selective pressure driving clades divergence. In particular, clade D showed the highest NDRIc values -with relatively high values for the DRIc index as well -resulting in the most rapidly diverging clade in the human population.
Next, to highlight for each clade the most diverging clade-specific genes, the clade-specific patterns of DRIg and NDRIg indices were computed ( Figure 3 and Table S4). For each clade, gene-level divergence indices showed positive values for a multitude of clade-specific genes, indicating an overwhelming iScience Article divergence rate in the human population that far exceeds that characteristic of housekeeping genes, as representative of a basal divergence.
For each clade, marker genes were then filtered, keeping only those with both global DRIg and NDRIg positive values. We interpreted the combination of higher mutation rates and more impactful mutations as a signature of active divergence of those regions, therefore investigating the function of such sequences may provide new glimpses over the selective pressures acting globally on F. prausnitzii.

Clade-specific marker genes show genetic signatures of purifying selection and selective sweeps
To confirm that clade-specific marker genes are evolving under a non-neutral process, we added Tajima's D 29 to our approach. This parameter allows one to identify sequences that do not fit the neutral theory model at equilibrium between mutation and genetic drift. Computing Tajima's D for F. prausnitzii on synonymous sites, to reduce the effects of selection, we observed negative values for all 5 clades (mean À1.6), with clade A showing the lowest value (À2.1) and clade D the highest (À1.3). Looking at the single gene contributions, we found that clade-specific marker genes contributed more to the negative values than core genes, indicating strong level of purifying selection with an excess of rare polymorphisms ( Figure 4). Also, together with the evidence from our indices, these estimates suggest that the higher values of the dN/dS ratio of the marker genes are probably caused by recent mutations, capturing a current selection still in progress, acting immediately after or in a context of selective sweeps.

Different clades show different functions of the clade-specific marker genes under divergence pressure
To investigate the function of clade-specific marker genes filtered according to the combination of DRIg and NDRg indices, KEGG Orthology 30 was used, allowing one to take into account the possible functional redundancy among the different markers. Thus, for each clade, we were able to obtain a profile of KOs corresponding to the most diverging clade-specific genes, i.e., those showing positive DRIg and NDRIg values. As expected, several KOs were specific to each single clade, whilst others were shared by two to four clades. No common functions to all clades were identified ( Figure 5 and Table S5).
In particular, clade A showed 64 distinctive KOs, including many genes related to sporulation, DNA repair, microbial resistance mechanisms (e.g., antibiotic biosynthesis, xenobiotic degradation, CRISPR proteins,  iScience Article CAMP-resistance) and several transporters and transcription factors. As for clade B, we identified 43 unique KOs, mainly concerning the two-component system, antibiotic resistance genes, membrane transporters, as well as DNA repair and one carbon pool by folate. Clade C presented 39 selective KOs involved in DNA repair, sporulation, antimicrobial resistance, beta-lactam resistance, xenobiotic degradation, as well as several efflux proteins, transcription factors, genes involved in tRNA biogenesis, ribosome biogenesis and aminoacyl-tRNA biosynthesis. In addition to these functions, Clade C was the only clade that showed the anti-inflammatory MAM (microbiota anti-inflammatory molecule) protein within the filtered marker genes. Clade D and clade E exhibited 11 specific KOs, with the first particularly enriched in inorganic ion transporters and functions related to amino acids metabolism and transport, and the second in carbohydrate and lipid transporters (Table S6).
Finally, for each clade, we explored the variation in clade-level divergence rates in different human populations. According to our findings, all clades, with the exception of clade C, showed a heterogeneous pattern of DRIc and NDRIc in the human populations considered ( Figure 6). In particular, quite opposite trends were found for clades A and D, with the former showing the highest divergence rates in hunter-gatherers and rural communities, and the latter diverging most actively in industrial urban populations.

DISCUSSION
Starting from previous evidences 12,14,15 that F. prausnitzii is a polytypic species, we performed a de novo clade identification process and then took a step forward to gauge possible determinants of clades divergence. In particular, by analyzing a panel of 92 F prausnitzii MAGs assembled from 740 human metagenomes and 55 available genomes from NCBI, we were able to define 5 distinct clades of the F. prausnitzii complex, on which we based our further research. Four divergence rate indices (DRIc, NDRIc, DRIg and NDRIg) were constructed and validated, which, combined with Tajima's D estimation, allowed for a curated assessment of the non-neutral divergence rate of each clade down to the gene level. Of interest, the exploitation of gene-level indices to identify the most rapidly diverging clade-specific marker genes allowed us to dissect the signatures of the possible selective pressures acting over these clades. In particular, for the clades A, B and C, the most rapidly diverging genes corresponded to functionalities that may allow to better cope with environmental changes, as well as to increase the inter-host dispersion capacity. Indeed, clade A was found to rapidly diverge in genes involved in several stages of the sporulation process, DNA repair and microbial resistance mechanisms, all of which are important factors for a prokaryotic cell to withstand and counteract environmental stresses. Similarly, clade B revealed a propensity to diverge functionalities related to the two-component system, mRNA expression regulation, aminoacyl-tRNA biosynthesis, transporters and membrane proteins, which may allow for a better metabolic flexibility in response to environmental stimuli. Finally, clade C combined a certain resistance potential, attributable to DNA repair, sporulation and resistance genes, with functional adaptability, as evidenced by several genes encoding transcription factors and transporters, and involved in the expression regulation. As a distinctive feature of clade C, the most rapidly diverging clade-specific genes also included functions related to the modulation of the immune response, such as the MAM protein, which has been shown to exert anti-inflammatory activities primarily via NF-kB pathway inibition. 31 In contrast, clade D and E showed a different pattern of iScience Article diverging functionalities, probably related to the exploitation of a copiotrophic gut environment, as indicated by the distinctive presence of genes coding for amino acid transport and metabolism, as well as carbohydrates and lipids transporters.
In our study, we found differences in the clade-level divergence rates between different human populations. In particular, clade A is diverging faster in hunter-gathering and rural populations, whereas clade D showed an opposite trend. Taken together, these observations might suggest that F. prausnitzii clades -or at least some of them -are evolving characteristic functional specializations that are better suited to the context of a specific host subsistence strategy which, in turn, would favor a more rapid divergence rate. For instance, clade Awhich is evolving functionalities to survive outside host -showed a better fit in traditional populations, where inter-host dispersion of GM components is still an active process as it is not compromised by the sanitization practices typical of Western populations. 32,33 Conversely, clade D, which is evolving adaptations for efficient exploitation of different substrates within the gut environment, showed a better fit and faster adaptive evolution in industrial urban populations, who are well known to consume high-fat/high-protein diets enriched in simple carbohydrates. 34 Future studies including the isolation and cultivation of different F. prausnitzii strains representing each clade should be crucial to better identify the specific selective pressures driving clade differentiation.
Overall, our findings may provide new insights into the possible factors driving to the differentiation of the observed subspecies groups in the F. prausnitzii taxon. This information may be helpful for better understanding the evolutionary propensity of this health-promoting GM component allowing, from our side, to adopt sustainable dietary and lifestyle practices to favor its retention in the human gut. This is particularly important for industrial urban populations, where a decrease in F. prausnitzii diversity and prevalence has been observed. 15 Possibly, the excess of sanitization practices typical of these populations is just facilitating the reduction of the F. prausnitzii clades A-C, which are evolving for better outside-host survival as a strategic factor for increasing their colonization of the human population.
Finally, the procedure we developed and implemented in this work can be virtually applied to every polytypic species of bacteria and, assuming the use of a sufficient number of genomes and metagenomes, could provide new ecological insights over the evolutionary forces shaping the microbial world around and within us.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

Lead contact
Further information and request for resources and reagents should be directed to and will be fulfilled by the lead contact, Simone Rampelli (simone.rampelli@unibo.it).

Materials availability
This study did not generate new unique reagents.
Data and code availability d All human gut metagenomic sequences used in this study are available in public repositories (see Table S1 for accession numbers).
d This paper does not report original code.
d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request

Human metagenomes
Human metagenome datasets used in this study are from 7 previously published studies, are available in public repositories (see Table S1 for accession numbers), and included 747 subjects spanning different countries (North America, Peru, Sweden, Germany, Italy, Tanzania and Japan) and lifestyles (industrial urban populations, hunter-gatherers and rural communities).

METHOD DETAILS
Constructing a F. prausnitzii genome panel with additional curated genomes from metagenomes A panel of 147 F. prausnitzii genomes comprising the entire set of available genomes through the NCBI RefSeq Genome repository (55 genomes, https://www.ncbi.nlm.nih.gov/refseq), and 92 manually curated MAGs (see the paragraph below ''metagenomic assembly to MAGs'') were collected for performing the analysis. Metagenomic samples from reference studies 16-22 were downloaded via Sequences Read Archive (SRA). 66 We included gut microbiome samples from individuals from different geographical regions and lifestyles for taking into consideration different aspects of gut microbiome variation. In particular, considered regions were: North America (urbans), Peru (rural inhabitants and hunter-gatherers), Sweden (urbans), Germany (urbans), Italy (urbans), Tanzania (hunter-gatherers) and Japan (urbans). Sequences were qualitychecked with FastQC v.0.11.8 35 and filtered for human reads using KneadData v0.7.2, 36 in case of singleend reads, and the MetaWRAP command ''read_qc'' (v1.0.2) 37 for paired-end reads. The panel was complemented with further 11 genomes from species of the Ruminococcaceae family that are considered as outgroup for clade definition in the subsequent analyses. Accession numbers of the F. prausnitzii NCBI genomes, metagenomic samples and OS reference genomes included in the study are provided in Table S1.

Metagenomic assembly to MAGs
To profile the microbial community composition contained in each quality-filtered sample, shotgun metagenomic sequencing data were analysed with MetaPhlAn2. 38 Reads from samples containing at least 1% F. prausnitzii were assembled using MegaHIT. 39 The minimum contig length considered for further analyses was set by default to 1kb. MetaBAT 2 40 and MaxBin 2 41 algorithms were used for the binning procedure, followed by quality analysis with CheckM. 42 Only genome bins with >95% bin completeness and <5% bin contamination were retained and taxonomically classified using PhyloPhlAn 3.0 43 (databaseSGB.Dec19) and MetaWRAP with the NCBI nucleotide and taxonomy databases. 67 Ninety-two high-quality MAGs classified at species level for F. prausnitzii were included within the genome panel.
Average nucleotide and genetic distances within the F. prausnitzii complex and between the complex and related species The average nucleotide identity (ANI) pairwise distances were computed using pyani (version 0.2.6; option '-m ANIb') 44  iScience Article species of the Ruminococcaceae family included in our panel. Percentage identity was converted into a distance measure, and distances scores were filtered to include only the pairwise comparisons where alignment lengths exceeded 500,000 bp.
The pairwise genetic distances between the same genomes compared above were calculated using a pipeline that included Prokka, 45 ROARY 28 and the package ''vegan'' of the R software. 68,69 In brief, each genome was first analysed by Prokka with the '-fast' flag, to identify open reading frames. 70 The core genome alignments were produced utilizing PRANK 46 included within the ROARY pipeline. For this step we set the minimum percentage identity for gene clustering to 90% and the minimum required presence for defining core genes to 90% of genomes. The pangenome information obtained, comprising a binary table with gene presence/absence, was used for building a genome-based Jaccard dissimilarity pairwise distance matrix in R using the ''vegdist'' command. 68 Clades were finally defined by hierarchical Ward-linkage clustering using both distance matrices. Permutational multivariate analysis of variance was used to verify whether the clades were significantly different from each other in terms of ANI and gene contents (FDR< 0.001).
Phylogenetic analysis of the F. prausnitzii genomes included in the genome panel A phylogenetic tree was built using the genome panel and PhyloPhlAn 2 26 . The configuration file was customized as by Tett at al., 71 using Diamond v0.9.9.110 47 for the mapping step, MAFFT v7.310 48 for the multiple sequence alignment, trimAl version 1.2rev59 49 for trimming, FastTree v2.1.10 50 for the first tree generated and RAxML v8.1.15 51 for the final tree. In addition to the customized configuration file, the parameters used were '-diversity low -fast'.

Identification of clade-specific marker genes and abundance analysis
Marker genes for each clade were identified by analysing the F. prausnitzii pangenome obtained with the Prokka and ROARY pipelines (see the ''average nucleotide and genetic distances within the F. prausnitzii complex and between the complex and related species'' paragraph above for further information). In particular, we defined as ''marker genes'' for a given clade, the genes present in at least 95% of the genomes of that specific clade and completely absent in all the others (see Table S7 for the number of marker genes identified for each clade). Nucleotide sequences for each pool of marker genes were used for building clade-specific databases with bowtie2-build. 55 To determine if a given clade was present in a metagenomic sample, the reads were mapped to the clade-specific markers using Bowtie2 55 and then processed to evaluate the marker genes coverage. 72 A marker was scored present if it had R0.5X coverage and a clade present if at least 50% of its clade-specific markers were hit. Finally, clade relative abundances for each metagenomic sample were calculated as the mean clade marker coverage multiplied by the F. prausnitzii genome size (bp) and divided by the metagenome size (bp).

Functional annotation
The functional annotation step was performed using the EggNOG mapper (version 1.0.3) 52 on the protein sequences identified by Prokka with the '-d bact'database option. The KEGG Brite Hierarchy was used to screen the EggNOG annotations. Fisher's exact test with Bonferroni's correction was used to identify significant differences (p<0.01) in gene content between clades.
We also sought for differences in the level of CAZymes. 73 Gene sequences were identified with HMMSEARCH 53 against the dbCAN HMMs v6 database, 74 using default parameters and applying postprocessing stringency cut-offs as suggested by the authors (if alignment length >80 AA, E-value is filtered for values < 1e-5, otherwise for values < 1e-3; then a cut-off is applied based on the covered fraction of HMM >0.3). 74 Only CAZy families that were significantly different in at least one clade (Bonferroni-corrected Fisher's exact test, p<0.01) were retained and graphically represented using the R package ''gplots''. 75 Finally, the genes encoding the MAM protein of F. prausnitzii were detected by aligning the protein sequence 31 against the full set of genes from the F. prausnitzii pangenome using protein-protein BLAST (v2.2.31+). 54 For a complete list of marker genes with annotated function, refer to